Fully Quantized Transformer for Machine Translation
121
FIGURE 5.1
Performance of quantized BERT with varying weight bit-widths and 8-bit activation on
MRPC and MNLI-m.
5.2
Fully Quantized Transformer for Machine Translation
Prato et al. introduce FullyQT, an all-inclusive quantization strategy for the Transformer.
Also, it is the first work to show that it is possible to avoid any loss in translation quality
with a fully quantized transformer [190]. Their method contains four parts: the quantization
scheme, the choice of quantized layer, tensor bucketing, and a unique design for zeros.
5.2.1
Quantization Scheme
The quantization scheme was uniform, meaning that the step size between two quantized
values is constant. This choice, which is an additional constraint, was made for practical
reasons. It simplifies all computations required during inference, enabling the exploitation of
hardware resources more efficiently. Given an element x of a tensor X, uniform quantization
scheme is defined as:
Q(x) = ⌊clamp(x; xmin, xmax) −xmin
s
⌉,
(5.7)
where xmin and xmax defines the endpoints of the quantization interval. The clamp function
associates all values outside of the [xmax, xmax] range to the closest endpoint, and ⌊·⌉
represents rounding to the nearest integer.
The step size s is computed by:
s = xmin −xmin
2b −1
,
(5.8)
where b is simply the bit precision.
When quantization is applied to weights, xmin and xmax are respectively min(X) and
max(X). However, when quantization is applied to activations, those values are running